Introduction to R and Rstudio

Session - Cleaning data with {dplyr}

Zoë Turner
Cartoon image with the word dplyr: go wrangling above. There are two fluffy characters with the bigger called dplyr being ridden by a smaller character with a hat like a cowboy. The cowboy is rounding up three others called data that look less friendly and are being caught with the cowboy's whip

Artwork by @allison_horst

Wrangling

Is the reshaping or transforming of data into a format which is easier to work with

This is often the largest part of many analyses and data science

A note on tidy data

Tidyverse functions work best with tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.

(Broadly, this means long rather than wide tables)

{dplyr} package

  • {dplyr} is a language for data manipulation
  • Most wrangling puzzles can be solved with knowledge of just a few dplyr verbs or functions
  • Many of the concepts of these functions exist in SQL but dplyr (and other packages) can extend this further

6 functions/verbs to start with

6 key verbs will help us gain a deeper understanding of our data sets.

Note summarise() can also be spelt summarize()

dplyr::arrange()
dplyr::filter()
dplyr::mutate()
dplyr::group_by()
# followed often by 
dplyr::ungroup()
dplyr::summarise()

Building with steps

These verbs aren’t used independently of each other.

Each can be a step in the code, like a recipe but can also be repeated.

A recipe starts with:

potato then
peel then
slice into medium sized pieces then
boil for 25 minutes then
mash

Recipe as code

The potato is the object in R terms and the steps are verbs or functions

Take a potato then peel then
slice into medium sized pieces then
boil for 25 minutes then
mash

potato |>
peel() |>
slice(size = "medium" |>
boil(time = 25) |>
mash()

Pipe

Shortcut key Ctrl+Shift+m

You might be familiar with the pipe |> which comes from the {magrittr} and is available through the {tidyverse} but the new pipe |> doesn’t require any packages to run

Screenshot of the Tools/Options wizard in the Code tab from the side and Editing at the top. Use native pipe operator option to select is highlighted.

Q1. Which organisation provided the highest number of Mental Health (MH) beds?

arrange()

Reorder rows based on selected variable

beds_data |> 
  arrange(beds_av)

Descending data

We need descending order:

beds_data |> 
  arrange(desc(beds_av))

desc() works for text and numeric variables

Q2. Which 2 organisations provided the highest number of MH beds in September 2018?

  • We’ll use arrange() as before
  • But we require only observations with this date

filter()

The expression inside brackets should return TRUE or FALSE. We are choosing rows where this expression is TRUE.

beds_data |> 
  filter(date == "2018-09-01") 

A negative test of equality

To exclude and test where the expression is NOT equal !=

beds_data |> 
  filter(date != "2018-09-01") 

Ordered and filtered

beds_data  |> 
  arrange(desc(beds_av)) |> 
  filter(date == "2018-09-01") 

Find the top 2 organisations

This isn’t a key function but useful

beds_data  |> 
  arrange(desc(beds_av)) |> 
  filter(date == "2018-09-01") |> 
  slice_head(n = 2)

Q3. Which 5 organisations had the highest percentage bed occupancy in September 2018?

  • We’ll use arrange() as before
  • We’ll use filter() as before
  • But we don’t have a percentage variable in the data

Create new variables

= in this context is an alias not a test of equality

beds_data |> 
  mutate(perc_occ = occ_av / beds_av) |> 
  filter(date == "2018-09-01") |> 
  arrange(desc(perc_occ)) 

Q4. What was the mean number of beds, (across all trusts) for each value of date?

  • Let’s first look at how we’d produce summary statistics like a mean
  • And then see how this can be applied to groups of data

summarise()

Collapses a single summary value

beds_data |> 
  summarise(mean_beds = mean(beds_av))

Missing values

We’ll need to remove NA values to get a suitable mean. TRUE can also be T

beds_data |> 
  summarise(mean_beds = mean(beds_av,
                             na.rm = TRUE)) 

Have a go!

Instead of mean() use median()

object |> 
  summarise(new_name = function_name(column_name,
                                     na.rm = ???))

Use a sum() statistic twice

object |> 
  summarise(col_1 = function_name(beds_av,
                                  na.rm = ???),
            col_2 = function_name(occ_av),
            na.rm = ???)
)
10:00

Answer for summary statistics

median()

beds_data |> 
  summarise(per_occ = median(beds_av,
                             na.rm = TRUE))
# A tibble: 1 × 1
  per_occ
    <dbl>
1     241

sum()

beds_data |> 
  summarise(total_beds = sum(beds_av, na.rm = TRUE),
            total_occupacy = sum(occ_av, na.rm = TRUE))
# A tibble: 1 × 2
  total_beds total_occupacy
       <dbl>          <dbl>
1     412480         368434

Applying summarise() to groups

  • Now we know how to use summarise (mean)
  • We’ll produce a summary value for each value of date

group_by() - persistent grouping

group_by() does nothing to the output alone.
The change occurs behind the scenes.

beds_data |> 
  group_by(date) 

ungroup()

Cartoon of fuzzy creatures created by Allison Horts with party hats on. Two are together and happy but one is behind holding a present and looking sad. The words.

Seeing ungroup()

head() is the base R version of slice() and is unaffected by group_by()

beds_data |> 
  group_by(date) |> 
  head(1)
beds_data |> 
  group_by(date) |> 
  ungroup() |> 
  slice_head(n = 1)

Group by - in operation

The summarise() function currently relies upon the group_by() function to apply summary statistics to groups/sets but in a new release the reliance will be changed to be function specific. Currently code

beds_data |> 
  group_by(date) |>
  summarise(mean_beds = mean(beds_av, 
                             na.rm = TRUE))

Break?

Option to take this break before an exercise of after

10:00

Q5. Which 5 organisations have the highest mean % bed occupancy?

  • For each of the organisations group_by()
  • Add 2 summary statistics, total_beds/total_occupancy to summarise() the data
  • mutate() the data to create a percentage using the previous
  • Order to find highest by using arrange()

Hint

beds_data |> 
  group_by() |> 
  summarise() |>
  mutate() |> 
  arrange()
10:00

Solution

beds_data |> 
  group_by(org_name) |> 
  summarise(total_beds = sum(beds_av, na.rm = TRUE),
            total_occupancy = sum(occ_av, na.rm = TRUE)) |> 
  mutate(perc_occ = total_occupancy / total_beds) |> 
  arrange(desc(perc_occ))